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Admissions in the university undergo procedures and requirements before a 
student can be officially enrolled. The senior high school grades remain the 
most significant in college admission decisions. This paper presents the use 
of data mining to cluster students based on admission datasets. The admission 
dataset for 2019-2020 was obtained from the office of student affairs and 
services. This dataset contains 2,114 observations with 11 attributes. Data 
preparation and data standardization were performed to ensure that the dataset 
is ready for processing and implemented in R programming language. The 
optimal number of clusters (k) was identified using the silhouette method. 
This method gave an optimal number of k=2 which was used in the actual 
clustering using the k-means and hierarchical clustering algorithms. Both 
algorithms were able to cluster students into two: cluster 1-social sciences or 


algorithm board courses and cluster 2-management or non-board courses. Further, 
Hierarchichal density-based spatial clustering of applications with noise (DBSCAN) 
K-means clustering algorithm was also used on the same dataset and it yielded a single 


cluster. This study can be replicated by using at least a 5-year dataset of 
students’ admission data employing other algorithms that would suggest 
students’ retention and turn over to board examinations. 
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1. INTRODUCTION 

In developing countries, improving the education sector is the principal interest of every government 
institution. The main role of higher education institutions (HEIs) is to attain global competitiveness and 
produce skilled human capital resources. Keeping standards in check is strenuous for any institution in line 
with the latest labor market demands [1]. The implementation of the k-12 program has drastically changed 
Philippine education perspectives and HEIs have been facing difficulties admitting freshmen applicants and 
incoming enrollees [2]. 

First-year college students in the Philippines are placed in their courses based on their senior high 
school strand, grade point average (GPA), and entrance examination results. Meeting these criteria will ideally 
lead students to the course that best suits them. However, failure to meet any of these criteria will cause a 
course mismatch, such as a low GPA will deny entry to board courses, low exam results will deny entry to 
board courses, and strand mismatch is not permitted. In these circumstances, clustering and cluster analysis 
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may be able to provide insights to admission officers as to where students will be placed. The use of data 
mining techniques can help overcome the difficulties of forecasting students' enrollment preferences. 

HEIs monitor which students need to qualify for particular programs and which students need further 
assistance to be eligible to graduate. However, this way is mostly difficult for HEIs to track the path of students. 
A viable way to counter these difficulties is through data analysis and data mining [3]. Data mining is a crucial 
method of distinguishing and deciphering information from a large quantity of data. This innovation is essential 
for academic institutions to work with greater efficiency. Using data mining techniques, student results can be 
foreseen. The substantive feature of data mining is to analyze a vast quantity of data intending to separate 
obscure patterns, such as cluster analysis, anomaly detection, and association rule mining [4]. 

Students desire to earn a degree from a reputable university and believe that finding the right advice 
on which programs to enroll will not be an issue [5]. During college registrations, students undergo an entrance 
examination to qualify for their desired program. However, college programs often lack appropriate 
information, causing students to be less concentrated in college. Students browse for information on their 
chosen program through college websites because there is no system for students to determine their program. 
Although, there is a decision support system provided to help students in their decision. This decision support 
system is supported by density-based spatial clustering of applications with noise (DBSCAN). This method is 
used for clustering and grouping of available data [6]. 

This study performed clustering of first-year college students using admission dataset as implemented 
in R and compared the results of the different clustering algorithms: k-means, hierarchical, and DBSCAN. 
When seeking more effective technology to better manage and support decision-making processes or helping 
academic institutions to develop new policies and plan for better management of the current processes [7], 
clustering admission data will increase the quality of their administrative decisions as they establish effective 
admissions standards [8]. Moreover, creating new techniques for information discovery from databases used 
in education systems may be applied to better decision-making [5], [9]. 


2. LITERATURE REVIEW 

Clustering is an unsupervised statistical data analysis technique where data is divided into subsets 
called clusters to find useful and hidden patterns [10], [11]. Additionally, cluster analysis allows a data scientist 
to look at the data from a different perspective without preconceived profiles [12]. In this paper, the terms 
clustering and cluster analysis are used interchangeably. Several techniques, including k-means, hierarchical, 
and DBSCAN, can be used to do cluster analysis [10]. One of the most used algorithms is k-means, which 
employs Euclidean distance as dissimilarity measure, tries to minimize within cluster distance, and maximize 
between-cluster distance [10], [13]. The k-means clustering separates a dataset D of n items into k groups after 
receiving the input of dataset D and the parameter k. The resultant intra cluster similarity is strong, while the 
inter cluster similarity is low because this split depends on the similarity measure. Cluster similarity is 
calculated using the mean value of the components in a cluster, which can be shown as the cluster's mean [14]. 
On the other hand, hierarchical clustering joins items to create clusters based on the presence of similar 
qualities. Hierarchical clustering refers to a process in which clusters in a hierarchy merge with one another at 
specific distances [15]. It creates a hierarchy to decide cluster allocations. This is accomplished using either a 
top-down or bottom-up methodology. A dendrogram, which is a point hierarchy based on a tree, is the end 
product of these procedures [14]. Lastly, DBSCAN is a clustering algorithm that can be used with datasets 
containing noise points. It can pinpoint noise points and exclude them from the results [16]. By examining the 
local density of corresponding items, DBSCAN is able to locate clusters in a huge spatial dataset. The 
DBSCAN algorithm has an edge over the k-means technique in that it can identify data points that are noise or 
outliers. DBSCAN can locate locations that do not belong to any clusters. However, it still scales to quite big 
datasets while being slower than agglomerative clustering and k-means [14]. 

K-means clustering was employed by [17] in one of their studies to look at school entry data. It was 
demonstrated to be efficient in identifying patterns and trends in the data and may be helpful in guiding 
admissions decision-making. The authors assert that educational institutions may modify their admission 
requirements to consider the traits of successful applicants. To forecast students' admission status to higher 
education based on their academic performance and other criteria, Santosa et al. [18] found that k-means 
clustering is effective in predicting admission status and can be used by schools to identify the students who 
are most likely to succeed in their programs. The admission process may be improved by classifying applicants 
into groups based on their traits and suitability [19]. 

Iqbal et al. [20] analyzed data on school entry using the hierarchical clustering approach. They used 
the Ward's linkage method to cluster students based on their academic achievement, the algorithm successfuly 
identified several groups of students with diverse academic profiles. The algorithm can be used to identify 
students who need additional support or interventions to succeed academically. Goyal and Vohra [21] used a 
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hierarchical algorithm to analyze admissions data from higher education. They found that by identifying the 
traits of qualified applicants, hierarchical clustering can enhance the admissions process by successfully 
classifying applicants into various groups based on their features. Bowers [22] used hierarchical clustering to 
classify students based on test results. They found that the technique successfully identified diverse student 
clusters with distinct performance characteristics. The algorithm can be used to identify areas where students 
may need extra assistance or to adjust admission requirements so that they more accurately identify applicants 
with the potential to succeed. 

Daniati [23] used the DBSCAN clustering approach to examine school entry data in one study. The 
study discovered that the algorithm was successful in identifying clusters of applicants with similar features by 
classifying applicants based on their academic standing and demographic criteria. By identifying the qualities 
of successful applicants, they suggested applying the algorithm to enhance the admissions process. 
Nafuri et al. [24] examined admission data from higher education using DBSCAN clustering. The algorithm 
proved successful in separating different groups of candidates based on the criteria when it was used to cluster 
applicants based on their academic achievement and other characteristics. They believe that by identifying 
areas where applicants might benefit from additional help, the algorithm may be used to enhance the admissions 
process. The same method was effectively applied by [25] to identify various student groups with distinctive 
performance profiles based on the test results of the students. The algorithm might be used to identify areas 
where students might need extra assistance or to adjust admission requirements, so they better recognize 
applicants with the potential to succeed. 


3. METHOD 

Cluster analysis was performed on an admission dataset to identify distinct groups among the students. 
The following steps were undertaken to perform cluster analysis of the admission dataset: 
a. Dataset description 

The admission dataset for academic year 2019-2020 was obtained from the Office of Student Affairs 
and Services of Cavite State University-Silang Campus. The dataset comprises 2,114 observations with 11 
attributes. Seven of these attributes are categorical (last name, given name, MI, sex, strand, municipality, and 
course) while 4 attributes are numerical (GPA, VAT, MAT, and total). Figure | shows a portion of the dataset, 
though some information has been blurred for data privacy. 


a 


Last 


Nana Given Name Initial Sex Municipality Strand GPA VAT MAT Total Course 

F F DASMARINAS 5 86.00 31 35 66 PSYCHOLOGY 
G M DASMARINAS 7 89.50 31 a7 58 HM 
€ M SILANG 6 85.25 39 31 70 BM 

$ F DASMARINAS 6 87.00 41 37 78 | BM 
F F DASMARINAS 4 92.75 4 19 60 | BSE-ENGLISH 
(a F BACOOR 7 92.36 45 30 75 TM 
P F GMA 5 90.72 39 17 56 | PSYCHOLOGY 

’ Cc F DASMARINAS 5 85.91 45 29 74 PSYCHOLOGY 
R F DASMARINAS rd 87.24 47 25 72 T™ 
M F DASMARINAS |5 91.50 30 24 54 PSYCHOLOGY 
S F SILANG 2 86.00 4 30 74 BM 
K F DASMARINAS 6 $1.00 31 15 46 BM 
a F DASMARINAS 5 84.79 39 22 61 PSYCHOLOGY 


Figure 1. Admission dataset 


b. Nominal data coding 

The strand attribute is represented by a numeric code, which follows the technique presented in [26]. 
In this technique, elements are assigned cardinal values based on their frequency, with higher frequency 
elements receiving higher cardinal values. The code is presented in Figure 2. 
c. Data cleaning 

Data cleaning removed the noise data or invalid data. In this step, last name, given name, MI, sex, and 
municipality attributes were removed. The cleaned dataset is shown in Figure 3. 
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Code Strand freq 
5 HUMSS 431 
7 TVL 
4 GAS 


6 ABM 
3 STEM 
1 ARTS 
2 ALS 


Figure 2. Codes for the strand attribute 


d. Standardizing the data 


Strand GPA 


5 
6 
6 
4 
5 


MAT Total 


y 


PSYCHOLOGY 


BSE-ENGLISH 
™ 
PSYCHOLOGY 


Figure 3. Portion of the cleaned data 


The data is transformed into a standardized form. The dataset was standardized to bring the numerical 
variables to a common scale. Standardization transforms the data by subtracting the mean and dividing by the 
standard deviation, ensuring that all variables have a similar influence on the clustering algorithm. This step is 
crucial to avoid any biases introduced by variables with larger scales dominating the clustering results. 


Figure 4 shows the standardized data. 


strand 
. 2008388 
0596631 
. 0596631 
2008388 
3857133 
2445376 
- 0596631 
. 0596631 
- 2445376 
. 2445376 


COFPFRFOFRPNFEN 


Figure 4. Portion of the standardized data 


GPA 


- 30883480 
. 52660650 
. 94432715 
. 03035437 
. 77976844 
. 80508693 
-11895761 
. 03035437 
- 30883480 
- 80508693 


VAT 
- 16784705 
- 35395814 
- 88868043 
. 35395814 
- 60951380 
- 60020825 
- 06548596 
. 78631933 
- 54006922 
- 34465258 


MAT 
. 80705204 
. 37604961 
- 16054840 
. 24020947 
. 26829901 
. 91480264 
- 37820463 
. 27045402 
. 64096735 
-45355567 


Total 


. 314703872 
. 687870518 
- 501287195 
. 620787133 
. 252509431 
. 929184949 
- 493823862 
- 369434980 
. 553703748 
. 737712682 


e. Identifying the optimal number of clusters for k-means and hierarchical clustering algorithms using 


silhouette method 


The silhouette method was used to identify the optimal number of clusters (k). The value of k=2. 


Figure 5 shows the result of the silhouette method. 


Optimal number of clusters 


4 5 6 7 


Number of clusters k 


Figure 5. Optimal number of clusters taken silhouette method 


These steps ensured the dataset was prepared and streamlined for cluster analysis, allowing for the 
identification of meaningful groups or clusters among the students. 


4. RESULTS AND DISCUSSION 
4.1. Cluster analysis using k-means 


The dataset was clustered using k=2. Figure 6 shows the result of k-means clustering. The cluster plot 
for k-means applied on the dataset is shown on Figure 7. The admission dataset for academic year 2019-2020 
was used to perform clustering of first year college students using the k-means algorithm, resulting in two 
clusters. These clusters based on the cluster plot can be: i) cluster 1-social science courses, putting together the 
education and psychology courses and ii) cluster 2-management courses, grouping together business 
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management, hotel management, tourism management, and information technology. Alternatively, the clusters 
can also be viewed as: cluster 1 are board courses and cluster 2 are non-board courses. The findings suggest 
that students enrolled in social science classes to gain knowledge and become more aware of the world. Social 
science and education classes have an effect on the environment in which the students live. As a result, courses 
in education and social science were developed using k-means and placed in the same cluster. The expectations 
placed on the enrolled students to be able to gain knowledge and skills that will contribute to and influence the 
socialization of society are indicative of the commonalities between these courses’ objectives and impact on 
society. 


uster means: 

strand GPA VAT MAT Total 
-0.4288269 -0.06165315 0.6196285 0.6756900 0.8041471 
2 0.3652970 0.05251935 -0.5278317 -0.5755878 -0.6850142 


22 
1 


within cluster sum of squares by cluster: 
[1] 182.8729 169.4553 
(between_ss / total_ss = 28.8 %) 


Figure 6. Result of k-means clustering 


Cluster plot 


a 
X 
Da PSYCHOLOGY 
H BM 
a 
PSYCHOLOGY cluster 
i tı 
[A] 2 


Dim2 (20%) 


„° PSYCHOLOG hyc he- : 
oe OSHS CHRLO 


Dim1 (48.1%) 


Figure 7. Cluster plot of k-means 


Based on the similarity of their traits in the admission dataset, the clustering algorithm divided the 
first-year college students into two clusters. The qualities or traits that were employed for clustering form the 
basis of the final groupings. It appears that the clustering algorithm in this instance used the students' course 
preferences as the main feature for clustering. The algorithm divided the students who selected psychology and 
education into one cluster and the students who selected business management, hotel management, tourism 
management, and information technology into a different cluster based on the cluster plot. The resulting 
clusters can be interpreted as grouping students who have similar academic interests or career goals. For 
example, cluster | (social science courses) could be interpreted as students who are interested in pursuing 
careers in education or psychology, while cluster 2 (management courses) could be interpreted as students who 
are interested in business or technology-related careers. The clusters can also be thought of as putting students 
into groups based on whether or not they have selected board courses. Board courses often relate to courses 
needed to pass a certain board exam, like one in engineering or medicine. The two groups that were discovered 
may have been the result of the clustering method using the board courses as a characteristic. The admission 
dataset's combination of several variables that indicate students’ interests, objectives, and academic 
backgrounds may be the cause of the clustering outcome overall. Depending on the dataset and the particular 
clustering technique employed, the specific features the clustering algorithm uses and the interpretation of the 
generated clusters may change. 
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4.2. Using hierarchical clustering 

The dataset was clustered using k=2. Figure 8 shows the result of hierarchical clustering presented in 
a cluster dendrogram. The cluster plot for hierarchical clustering applied on the dataset is shown on Figure 9. 
Using hierarchichal clustering algorithm, it obtained with the same result as clustering using the k-means 
algorithm. The cluster 1 is composed of board courses, while the cluster 2 is composed of non-board courses. 
Academic courses that aim to improve students' similar prerequisites in pursuing and deepening their 
understanding of business are grouped together as business courses in cluster 2. Courses in this cluster share 
qualities with those expected of registrants, including standards that help them grow in terms of project 
management, leadership, critical, and strategic thinking. 


Cluster Dendrogram 


UUO U 
SS > 
annnnnnnn A 


hclust (*, "complete") 


Figure 8. Cluster dendrogram 


Cluster plot 


cluster 


og 


Dim2 (20%) 


>o 


Dim! (48.1%) 


Figure 9. Cluster plot of hierarchical clustering 


Like the k-means approach, the hierarchical clustering algorithm divided the first-year college 
students into two clusters based on the admission dataset's commonalities. By using the hierarchical clustering 
process, groups are iteratively merged or split according to how similar they are to one another. Based on the 
cluster descriptions produced, it appears that both the k-means method and the hierarchical clustering algorithm 
both used the students' course selection as a characteristic. The method put all board courses in cluster 1 and 
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all non-board courses in cluster 2, respectively. Additionally, it appears from the cluster descriptions that the 
algorithm determined shared traits or features among the courses in cluster 2. Specifically, it implies that 
cluster 2 courses are designed to strengthen students’ foundations for pursuing and extending their grasp of 
business. Moreover, the courses in this cluster share characteristics with those that are of expected enrollees, 
such as project management, leadership, and critical and strategic thinking. These interpretations of the 
clustering results could be based on a greater comprehension of the academic programs and courses offered by 
the college, as well as the professional objectives and interests of the students. It is possible that the clustering 
algorithm picked up on these tendencies in the admission dataset and grouped the students accordingly. It is 
important to note that these interpretations are based on the information provided and may not accurately reflect 
the underlying trends in the data. 


4.3. Using density-based spatial clustering of applications with noise 

The dataset was clustered using minPts=7. Figure 10 shows the result of k-means clustering. The 
cluster plot for k-means applied on the dataset is shown on Figure 11. As previously stated, both k-means and 
hierarchical clustering algorithms produced two clusters with similar cluster plots. In contrast, DBSCAN 
algorithm resulted in only one cluster. It indicates that there were not enough dense regions in the data for the 
system to identify multiple clusters. This could be due to several factors, such as the choice of density 
parameter, distance measure, or data quality. The clusters may not provide sufficient information to enable 
administrators to determine how to allocate courses. This is because these clusters are based solely on the 
courses that students choose to enroll in and do not take into account other factors such as students’ academic 
backgrounds, interests, or career objectives. 


DBSCAN Clustering for 2114 objects 


Parameters: eps = 0.8, minPts = 
The clustering contains 1 cluster(s) and 213 noise points. 


0 1 
213 1901 


Figure 10. Result of DBSCAN 


Cluster plot 


cluster 
tı 


Dim2 (20.1%) 


Dim1 (45.7%) 


Figure 11. Cluster plot of DBSCAN 


Using clustering as a decision-support tool to guide students' course work is still a good idea. 
Clustering can be a useful method for identifying patterns and similarities in large datasets, which can then be 
used to make informed decisions about course offerings and student assignments. To ensure that the clustering 
results are meaningful and practical, administrators can consider incorporating other factors and data sources 
into the clustering algorithm, such as statistics on student performance or survey findings. 

The clustered results for board and non-board courses offer a vague indication for choosing the 
subjects that will be assigned to students. If this takes place, the university's courses will lose their significance 
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and distinctiveness. To give the school's name some value, administrators should consider using this as a 
decision-support tool in directing students' course work. 


5. CONCLUSION 

Using the different clustering algorithms, in particular, k-means, hierarchical, and DBSCAN, it 
yielded an almost similar number of clusters and cluster plots. Its implementation in R showed that the 
admission dataset for first year college students group can be divided into 2 clusters. In this light, cluster 
analysis may be able to give insights to admission officers as to where students will be placed. The algorithms 
utilized in this work are effective resources for studying data on student admissions. The schools can make 
better decisions regarding the admissions process and determine the traits of successful student-applicants by 
spotting patterns and trends in the data. The result of the study can aid other HEI in making more efficient 
judgments in enhancing their existing procedures and increasing the bar for students’ admission, assistance, 
and support. It is helpful in predicting students’ turn over to board examinations, retention, increasing 
graduation rates, curriculum revisions and enhancement, and effectively assess the performance of the 
university. 

Using the clustering methodology, admissions officers can classify prospective students into clusters 
according to their academic qualifications and personal interests. Informed judgments about course placement 
and counseling can be made by admissions staff using this data, which will eventually increase student success 
and retention. To enhance and change the curriculum, patterns and trends in the courses that students in each 
cluster choose to take can be discovered using the clustering analysis. With the information provided in the 
result of this study, the curriculum of educational institutions may be improved to better address the needs and 
interests of the students. The institution may evaluate the efficacy of its programs and policies and make 
data-driven decisions for improvement by following changes in the clustering patterns over time. Overall, it 
offers insightful information on the requirements and characteristics of university students. The institution may 
boost student achievement, improve operations, and improve its reputation and value by leveraging these 
insights to guide decision-making. For future works, other attributes such as locality, family income, and skills 
may also be looked into. 
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